Revisiting Aggregation for Data Intensive Applications: A Performance Study

نویسندگان

  • Jian Wen
  • Vinayak R. Borkar
  • Michael J. Carey
  • Vassilis J. Tsotras
چکیده

Aggregation has been an important operation since the early days of relational databases. Today’s Big Data applications bring further challenges when processing aggregation queries, demanding adaptive aggregation algorithms that can process large volumes of data relative to a potentially limited memory budget (especially in multiuser settings). Despite its importance, the design and evaluation of aggregation algorithms has not received the same attention that other basic operators, such as joins, have received in the literature. As a result, when considering which aggregation algorithm(s) to implement in a new parallel Big Data processing platform (AsterixDB), we faced a lack of “off the shelf” answers that we could simply read about and then implement based on prior performance studies. In this paper we revisit the engineering of efficient local aggregation algorithms for use in Big Data platforms. We discuss the salient implementation details of several candidate algorithms and present an in-depth experimental performance study to guide future Big Data engine developers. We show that the efficient implementation of the aggregation operator for a Big Data platform is non-trivial and that many factors, including memory usage, spilling strategy, and I/O and CPU cost, should be considered. Further, we introduce precise cost models that can help in choosing an appropriate algorithm based on input parameters including memory budget, grouping key cardinality, and data skew.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Replication-Based Scheduling in Cloud Computing Environment

Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...

متن کامل

An input-oriented radial measure for returns to scale aggregation.

In production theory, it is necessary to be capable of predicting the production func- tion’s long-run behaviors. Hereof, returns to scale is a helpful concept. Returns to scale describes the reaction of a production function to the proportionally scaling all its input variables. In this regard, Data envelopment analysis (DEA) provides a com- prehensive framework for returns to scale evaluation...

متن کامل

Advanced network monitoring applications based on mobile/intelligent agent technology

Mobile Agents (MA) have been proposed as a solution for distributed Network Management (NM). However, most MA-based infrastructures exhibit scalability limitations when data intensive management applications are considered. Therefore, we present three novel applications, tailored to transfers of bulk network monitoring data, in which MAs are used to perform data aggregation, acquire atomic SNMP...

متن کامل

A Temporal Aggregation Method for Update-Intensive Applications

The temporal aggregation in temporal databases is an extension of the conventional aggregation including the time as a range condition of the aggregation. In this paper, we propose a new tree based structure for the temporal aggregation, called the CTA-tree. In the CTA-tree, we transform the time interval of a temporal data record into a value, called the T-value, using the Corner transformatio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1311.0059  شماره 

صفحات  -

تاریخ انتشار 2013